{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Getting Started\n",
    "\n",
    "### Basic supervised training and testing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "┌ Info: Recompiling stale cache file /Users/anthony/.julia/compiled/v1.0/MLJ/rAU56.ji for MLJ [add582a8-e3ab-11e8-2d5e-e98b27df1bc7]\n",
      "└ @ Base loading.jl:1190\n"
     ]
    }
   ],
   "source": [
    "using Revise\n",
    "using MLJ\n",
    "using RDatasets\n",
    "\n",
    "iris = dataset(\"datasets\", \"iris\"); # a DataFrame"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In MLJ one can either wrap data for supervised learning in a formal *task* (see [Working with Tasks](tasks.jl)), or work directly with the data, split into its input and target parts:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "const X = iris[:, 1:4];\n",
    "const y = iris[:, 5];"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A *model* is a container for hyperparameters:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "import MLJModels ✔\n",
      "import DecisionTree ✔\n",
      "import MLJModels.DecisionTree_.DecisionTreeClassifier ✔\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "# \u001b[0m\u001b[1mDecisionTreeClassifier{String} @ 2…98\u001b[22m: \n",
       "target_type             =>   String\n",
       "pruning_purity          =>   1.0\n",
       "max_depth               =>   2\n",
       "min_samples_leaf        =>   1\n",
       "min_samples_split       =>   2\n",
       "min_purity_increase     =>   0.0\n",
       "n_subfeatures           =>   0.0\n",
       "display_depth           =>   5\n",
       "post_prune              =>   false\n",
       "merge_purity_threshold  =>   0.9\n",
       "\n"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@load DecisionTreeClassifier\n",
    "tree_model = DecisionTreeClassifier(target_type=String, max_depth=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Wrapping the model in data creates a machine which will store training outcomes (called fit-results):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "# \u001b[0m\u001b[1mMachine{DecisionTreeClassifier{S…} @ 1…36\u001b[22m: \n",
       "model                   =>   \u001b[0m\u001b[1mDecisionTreeClassifier{String} @ 2…98\u001b[22m\n",
       "fitresult               =>   (undefined)\n",
       "cache                   =>   (undefined)\n",
       "args                    =>   (omitted Tuple{DataFrame,CategoricalArray{String,1,UInt8,String,CategoricalString{UInt8},Union{}}} of length 2)\n",
       "report                  =>   empty Dict{Symbol,Any}\n",
       "rows                    =>   (undefined)\n",
       "\n"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tree = machine(tree_model, X, y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Training and testing on a holdout set:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "┌ Info: Training \u001b[0m\u001b[1mMachine{DecisionTreeClassifier{S…} @ 1…36\u001b[22m.\n",
      "└ @ MLJ /Users/anthony/Dropbox/Julia7/MLJ/src/machines.jl:68\n"
     ]
    }
   ],
   "source": [
    "train, test = partition(eachindex(y), 0.7, shuffle=true); # 70:30 split\n",
    "fit!(tree, rows=train)\n",
    "yhat = predict(tree, X[test,:]);\n",
    "misclassification_rate(yhat, y[test]);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or, in one line:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.08888888888888889"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "evaluate!(tree, resampling=Holdout(fraction_train=0.7, shuffle=true), measure=misclassification_rate)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Changing a hyperparameter and re-evaluating:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.06666666666666667"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tree_model.max_depth = 3\n",
    "evaluate!(tree, resampling=Holdout(fraction_train=0.5, shuffle=true), measure=misclassification_rate)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Prerequisites\n",
    "\n",
    "MLJ assumes some familiarity with the [CategoricalArrays.jl]() package, used for respresenting categorical data. For probablilistic predictors, a basic acquaintance with [Distributions.jl]() is also assumed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data\n",
    "\n",
    "While MLJ is data *container* agnostic it is very fussy about *element* types. Every user needs to understand the basic assumptions about the form of data expected by MLJ, as outlined below.\n",
    "\n",
    "On the one hand, MLJ is quite flexible about how tabular data is presented: Anywhere a table is expected (eg, `X` above) any tabular format supporting the [Tables.jl]() API is allowed. A single features (such as the target `y` above) is expected to be a `Vector` or `CategoricalVector`, according to the scientific type of the data (see below). A multivariate target can be any table. \n",
    "\n",
    "On the other hand, the *element types* you use to represent your data has implicit consequences about how MLJ will interpret that data. \n",
    "\n",
    "To articulate MLJ's conventions about data representation, it is useful to distinguish between *machine* data types on the one\n",
    "hand (`Float64`, `Bool`, `String`, etc) , and *scientific* data types\n",
    "on the other, represented in MLJ by new Juilia types: `Continuous`, `Discrete`, `Multiclass{N}`, `FiniteOrderedFactor{N}`, and `Count`, having obvious interpretions. These types, which are organized in a type heirarchy (see [Scientific Data Types](scientific_data_types.md)), are used by MLJ for dispatch (but have no corresponding instances).\n",
    "\n",
    "Scientific types appear when querying model metadata, as in this example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Multiclass"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "info(\"DecisionTreeClassifier\")[:target_scitype]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Basic data convention** The scientific type of data that a Julia object `x` can represent is defined by `scitype(x)`. If `scitype(x) == Other`, then `x` cannot represent scalar data in MLJ."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(Count, Continuous, MLJBase.Other)"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(scitype(42), scitype(π), scitype(String))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In particular, *integers cannot be used to represent `Multiclass` or `FiniteOrderedFactor` data*; these can represented by an unordered or ordered `CategoricalValue` or `CategoricalString`:\n",
    "\n",
    "`T`                     |     `scitype(x)` for `x::T`\n",
    "------------------------|:--------------------------------\n",
    "`Missing`                 |      `Missing`\n",
    "`Real`                    |      `Continous`\n",
    "`Integer`                |        `Count`\n",
    "`CatetoricalValue`       | `Multiclass{N}` where `N = nlevels(x)`, provided `x.pool.ordered == false` \n",
    "`CatetoricalString`       | `Multiclass{N}` where `N = nlevels(x)`, provided `x.pool.ordered == false`\n",
    "`CatetoricalValue`       | `FiniteOrderedFactor{N}` where `N = nlevels(x)`, provided `x.pool.ordered == true` \n",
    "`CatetoricalString`       | `FiniteOrderedFactor{N}` where `N = nlevels(x)` provided `x.pool.ordered == true`\n",
    "`Integer`                 | `Count`\n",
    "\n",
    "Here `nlevels(x) = length(levels(x.pool))`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Next steps"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "See the MLJ [tour](tour.ipynb)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Julia 1.0.2",
   "language": "julia",
   "name": "julia-1.0"
  },
  "language_info": {
   "file_extension": ".jl",
   "mimetype": "application/julia",
   "name": "julia",
   "version": "1.0.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
